Augmented Comparative Corpora and Monitoring Corpus in Chinese: LIVAC and Sketch Search Engine Compared

نویسنده

  • Benjamin Ka-Yin T'sou
چکیده

The increasing availability of numerous corpora has significantly contributed to the understanding of words in terms of their underlying semantic structures and lexical networks (e.g. COBUILD, WordNet etc.). Through data mining and information retrieval, research in this area has vastly expanded our appreciation that what constitutes lexical knowledge goes beyond synonymy, hyponymy, metonymy, meronymy, grammatical and other collocations. Furthermore, they are fundamental to a universalistic conceptual base of ontologies and knowledge representation which are often enriched by deeper and newer analysis. In this context, each language foregrounds specific features or nodes within this knowledge base by usually non-uniform means. At the same time, the arrival of the age of Big Data has attracted extensive studies on the actual and dynamic use of language as contextualized (ala. Jakobson 1960) within a given society, especially through the mass media. What are foregrounded in this medium tend to have graded cognitive saliency characterizing members of the common speech community, and such shared knowledge is usually at great variance with the thesaurus approach and show noticeable localized features. It is proposed here that the two kinds of knowledge (thesauric vs cognitive-cultural) complement each other in human cognition, and are integral to it. We draw on two large Chinese media databases Sketch (2.1 billion character tokens1) and LIVAC (550 million character tokens2) for illustration and discussion. The Sketch Engine in Chinese shows how apple is, as expected, primarily related to orange, peach, fruit, vegetable, food etc. At the

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Word sketch lexicography: new perspectives on lexicographic studies of Chinese near synonyms

Comparative study of near synonyms is one of the most productive research paradigms in Chinese lexicography. Empirical studies to discriminate near synonyms are either introspection-based or corpus-based. Yet, due to the large quantity of data in a corpus, lexicological studies of Chinese rarely make full use of the corpus data. To solve this problem, Kilgarriff’s Word Sketch Engine is designed...

متن کامل

Sketching the Dependency Relations of Words in Chinese

We proposes a language resource by automatically sketching grammatical relations of words based on dependency parses from untagged texts. The advantage of word sketch based on parsed corpora is, compared to Sketch Engine (Kilgarriff, Rychly, Smrz, & Tugwell, 2004), to provide more details about the different usage of each word such as various types of modification, which is also important in la...

متن کامل

Comparative evaluation of tools for Arabic corpora search and analysis

As the number of Arabic corpora is constantly increasing, there is an obvious and growing need for concordancing software for corpus search and analysis that supports as many features as possible of the Arabic language, and provides users with a greater number of functions. This paper evaluates seven existing corpus search and analysis tools based on eight criteria which seem to be the most ess...

متن کامل

Large Linguistically-Processed Web Corpora for Multiple Languages

The Web contains vast amounts of linguistic data. One key issue for linguists and language technologists is how to access it. Commercial search engines give highly compromised access. An alternative is to crawl the Web ourselves, which also allows us to remove duplicates and nearduplicates, navigational material, and a range of other kinds of non-linguistic matter. We can also tokenize, lemmati...

متن کامل

Comparative Study of the Academic Vocabulary Content of Electronic Engi-neering Corpora, GE Materials and M.S. Entrance Examinations

The importance of vocabulary learning has been underlined in the field of English for Academic Purposes (EAP) because non-English majors who require reading English texts in their fields of study have to expand their English vocabulary knowledge much more efficiently than ordinary ESL/EFL learners. Since academic vocabulary instruction in Iranian universities is realized through the use of Gene...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015